Paragraph vector based topic model for language model adaptation
نویسندگان
چکیده
Topic model is an important approach for language model (LM) adaptation and has attracted research interest for a long time. Latent Dirichlet Allocation (LDA), which assumes generative Dirichlet distribution with bag-of-word features for hidden topics, has been widely used as the state-of-the-art topic model. Inspired by recent development of a new paradigm of distributed paragraph representation called paragraph vector, a new topic model based on paragraph vector is proposed in this work. During training, each paragraph is mapped to a unique vector in continuous space. Then unsupervised clustering is performed to construct topic clusters. Topic-specific LM is then built based on clustering results. During adaptation, topic posterior is first estimated using the paragraph vector based topic model and new adapted LMs are constructed by interpolating the existing topic-specific models using topic posteriors. The proposed topic model is applied for N-gram LM adaptation and evaluated on Amazon Product Review Corpus for perplexity and a Chinese LVCSR task for CER evaluation. Results show that the proposed approach yields 11.1% relative perplexity reduction and 1.4% relative CER reduction over N-gram baseline, outperforming LDA based method proposed by previous work.
منابع مشابه
مقایسه روش های طیفی برای شناسایی زبان گفتاری
Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...
متن کاملFree Model of Sentence Classifier for Automatic Extraction of Topic Sentences
This research employs free model that uses only sentential features without paragraph context to extract topic sentences of a paragraph. For finding optimal combination of features, corpus-based classification is used for constructing a sentence classifier as the model. The sentence classifier is trained by using Support Vector Machine (SVM). The experiment shows that position and meta-discours...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملGenerative Paragraph Vector
The recently introduced Paragraph Vector is an efficient method for learning highquality distributed representations for pieces of texts. However, an inherent limitation of Paragraph Vector is lack of ability to infer distributed representations for texts outside of the training set. To tackle this problem, we introduce a Generative Paragraph Vector, which can be viewed as a probabilistic exten...
متن کاملDetermination of weight vector by using a pairwise comparison matrix based on DEA and Shannon entropy
The relation between the analytic hierarchy process (AHP) and data envelopment analysis (DEA) is a topic of interest to researchers in this branch of applied mathematics. In this paper, we propose a linear programming model that generates a weight (priority) vector from a pairwise comparison matrix. In this method, which is referred to as the E-DEAHP method, we consider each row of the pairwise...
متن کامل